Building a Discourse-Annotated Dutch Text Corpus
نویسندگان
چکیده
We are compiling a corpus of Dutch texts annotated with discourse structure and lexical cohesion, containing initially 80 texts from expository and persuasive genres. We are using this resource for corpus-based studies of discourse relations, discourse markers, cohesion, and genre differences. We are also exploring the possibilities of automatic text segmentation and semi-automatic discourse annotation. This paper discusses our design choices in text selection and segmentation and in the annotation of discourse structure and lexical cohesion.
منابع مشابه
KNACK-2002: a Richly Annotated Corpus of Dutch Written Text
In this paper, we introduce the annotated KNACK-2002 corpus of Dutch written text. The corpus features five different annotation layers, ranging from the annotation of morphological boundaries at the word level, over the annotation of part-of-speech tags and phrase chunks at the syntactic level to the annotation of named entities at the semantic level and coreferential relations at the discours...
متن کاملBuilding an Annotated Corpus for Text Summarization and Question Answering
We describe ongoing work in semi-automatic annotating corpus, with the goal to answer “why” question in question answering system and give a construction of the coherent tree for text summarization. In this paper we present annotation schemas for identifying the discourse relations that hold between the parts of text as well as the particular textual of span that are related via the discourse r...
متن کاملMulti-Layer Discourse Annotation of a Dutch Text Corpus
We have compiled a corpus of 80 Dutch texts from expository and persuasive genres, which we annotated for rhetorical and genre-specific discourse structure, and lexical cohesion with the goal of creating a gold standard for further research. The annotations are based on a segmentation of the text in elementary discourse units that takes into account cues from syntax and punctuation. During the ...
متن کاملPenn Discourse Treebank: Building a Large Scale Annotated Corpus Encoding DLTAG-based Discourse Structure and Discourse Relations
Large scale annotated corpora have played a critical role in speech and natural language research. However, while existing annotated corpora such as the Penn Treebank have been highly successful at the sentence-level, we also need large-scale annotated resources that reliably encode key aspects of discourse. In this paper, we detail (1) our plans for building the Penn Discourse Treebank (PDTB),...
متن کاملComputational Analysis of Coherence Relations in Dutch
The NWO-programme Modelling textual organisation: coherence and cohesion studies the organisation of text into structural units by means of coherence (discourse relations between clausal and larger textual units) and cohesion (lexico-semantic relations between words in textual units). The programme is organised around two related PhD-projects, focussing on coherence and cohesion, respectively. ...
متن کامل